Introduction
Basic terminnology
1. Opening Price The opening price is the price at which a security first trades when an exchange opens for the day.
2. Closing Price The closing price is the last price at which the stock traded during the regular trading day.
3. High Price Higest Price of the stock in the day.
4. Low Price Lowest Price of the stock in a day.
5. Adjusted Close Price Adjusted closing price factors in corporate actions such as stock splits, dividends / distributions and rights offerings.
6. Market Captilization Market capitalization refers to the total dollar market value of a company’s outstanding shares of stock.It is equal to product of total shares and current stock price.
7. Trends It refers the change in the average value of data(increase, decrease or stationary).
8. Seasonality It refers to the trends that are repeating in nature w.r.t some specific time.
9. Cyclical Trends with no set repetition.
10. Stock Market index is a measurement of a section of the stock market. It is computed from the prices of selected stocks (typically a weighted average). It is a tool used by investors and financial managers to describe the market, and to compare the return on specific investments.There are many stock market index but most important of them is S&P 500.
S & P 500
- Stock market index that tracks the 500 Large-cap U.S companies.
- It represents the Stock market performance by reporting the risks and returns of 500 Large-cap U.S Companies
- It measures public shares only not those hold by control group,other companies or Govt.
- Index given to these is weighted by float-adjusted market capitalization.
- index’s 500 companies are selected on the basis of liquidity,size and industry.
- It reblances the index quartely. In March, June, September and December.
Note: Other popular indices include the S&P MidCap 400, which represents the mid-cap range of companies and the S&P SmallCap 600, which represents small-cap companies.
Requirments to qualify for the index
- Company must be in US and have market cap of at least $8.2 billion.
- At least 50% of the company stock must be public.
- Its Stock price must be atleast $1 per share.
- At least 50% of its fixed assests and revenues must be in USA
- Must have at least four consecutive quarters of positive earning.
- Stock must be listed in New york stock exchange, Investors Exchange, NASDAQ, or BATS. It cannot be over-the-counter or listed on pink sheets.
Weighting
The S&P 500 uses a market capitalization weighting method, giving a higher percentage allocation to companies with the largest market capitalizations.
It provides investors a sense of how the Index of a company will effect given any change to its current stock price.
Index calculation
Index is calculated by:
were \(P_i\) is price of each stock,\(S_i\) is number of shares publicaly available and D is a constant called divisor.
The Divisor is considered to be proprietary to the firm. However, the Divisor’s value is approximately 8.9 billion.
The Divisor is adjusted in case of corporate actions to ensure that they don’t effect the value of index.
Portfolio
In finance, a portfolio is a collection of investments held by an investment company, hedge fund, financial institution or individual.
Create Portfolio
import pandas as pd
start=pd.to_datetime("2014-01-01")
end=pd.to_datetime("2019-01-01")
Grabing Data
Before grabing data we need to know their stock symbols also called tickers eg for Amazon it is AMZN
import bs4 as bs
import pickle
import requests
def save_sp500_tickers(path=''):
"""
Return tickers of S&P 500 and saves them in a pickle.
Parameters path: path to save the ticker.
"""
resp = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = bs.BeautifulSoup(resp.text,'lxml')
table = soup.find('table',{'class':"wikitable sortable"})
tickers =[]
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text.replace("\n",'')
ticker =ticker.replace(".","-")
tickers.append(ticker)
with open('sp500tickers.pickle','wb') as f:
pickle.dump(tickers,f)
return tickers
tickers = save_sp500_tickers()
Data can be grabbed by pandas.DataReader() or quandl.get()
pandas.DataReader() provide all data free but it doesnot work in some geographical locations.
quandl.get() provides a limited free data.
import datetime as dt
import os
import pandas_datareader as web
import numpy as np
import quandl
import matplotlib.pyplot as plt
%matplotlib notebook
#Grabbing a bunch of tech stocks for our portfolio
aapl = quandl.get('WIKI/AAPL.11',start_date=start,end_date=end)
cisco = quandl.get('WIKI/CSCO.11',start_date=start,end_date=end)
ibm = quandl.get('WIKI/IBM.11',start_date=start,end_date=end)
amzn = quandl.get('WIKI/AMZN.11',start_date=start,end_date=end)
#saving data for future use
aapl.to_csv('AAPL_CLOSE')
cisco.to_csv('CISCO_CLOSE')
ibm.to_csv('IBM_CLOSE')
amzn.to_csv('AMZN_CLOSE')
For all sp500 data i have use bellow function
def get_data(ticker,start,end,path,get_ticker=False,api ="yahoo"):
"""
Save data of all S&P 500 companies.
parameters
get_ticker= bolean, if true gets tickers.
ticker: list of tickers
path: string,path to save the data frames.
start : tuple.start date of dataframe (year,month,day)
end : tuple, start date of dataframe (year,month,day)
api: string ,name of api to get data from.
"""
i=0
if get_ticker:
tickers = save_sp500_tickers()
if not os.path.exists(path):
os.makedirs(path)
start = dt.datetime(start)
end = dt.datetime(end)
for ticker in tickers:
if not os.path.exists(path+'/{}.csv'.format(ticker)):
df = web.DataReader(ticker,api,start,end)
df.to_csv(path,'/{}.csv'.format(ticker))
else:
print("{}.csv Already exists".format(ticker))
i+=1
if i%20==0:
print(str((i/len(tickers))*100)+" % completed")
def compile_data(tickers,path):
"""
Return a combined dataframe which include Adj. Close columns of all S&P 500 data frames
parameters:
tickers=list of tickers
path : path of folder where data frames are saved.
"""
main_df = pd.DataFrame()
for count,ticker in enumerate(tickers):
df= pd.read_csv(path+"/{}.csv".format(ticker))
df.set_index("Date",inplace=True)
df.rename(columns={"Adj Close":ticker},inplace=True)
df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)
if main_df.empty:
main_df=df
else:
main_df=main_df.join(df,how='outer')
if count%100==0:
print(str((count/len(tickers))*100)+" % completed")
main_df.to_csv(path+'/sp500_joined_closes.csv')
return main_df
To check and visualize the corelation between the stocks i picked for my portfolio
def visulize_data(df,figsize):
"""
returns a heatmap
"""
df_corr=df.corr()
data=df_corr.values
fig = plt.figure(figsize=figsize)
ax= fig.add_subplot(111)
heatmap = ax.pcolor(data,cmap=plt.cm.RdYlBu)
fig.colorbar(heatmap)
ax.set_xticks(np.arange(data.shape[0]) + 0.5,minor=False)
ax.set_yticks(np.arange(data.shape[1]) + 0.5,minor=False)
ax.invert_yaxis()
ax.xaxis.tick_top()
column_labels = df_corr.columns
row_labels=df_corr.index
ax.set_xticklabels(column_labels)
ax.set_yticklabels(row_labels)
plt.xticks(rotation=90)
heatmap.set_clim(-1,1)
df=pd.concat([aapl,cisco,ibm,amzn],1)
df.columns=['AAPl','CISCO','IBM','AMZN']
df.tail()
| AAPl | CISCO | IBM | AMZN | |
|---|---|---|---|---|
| Date | ||||
| 2018-03-21 | 171.270 | 44.31 | 156.69 | 1581.86 |
| 2018-03-22 | 168.845 | 43.07 | 152.09 | 1544.10 |
| 2018-03-23 | 164.940 | 42.42 | 148.89 | 1495.56 |
| 2018-03-26 | 172.770 | 44.06 | 153.37 | 1555.86 |
| 2018-03-27 | 168.340 | 42.68 | 151.91 | 1497.05 |
visulize_data(df,(5,5))

To check the volatility of your stocks check the kde plots.
df[['AAPl','CISCO',"IBM","AMZN"]].pct_change(1).plot(kind='kde')
<matplotlib.axes._subplots.AxesSubplot at 0x7f79b4709f50>

Normalize values
we have 3 methods to to normalize values but mostly cumulative return is used
Daily Percentage Change
- Returns percentage gain or loss w.r.t previous data point.
- Helpfull in analyzing volatility.
- volatility is given by standard devaition of Daily percentage change column.
- use
pandas.shiftmethod orpct_changemethod
aapl['return']=((aapl['Adj. Close']/aapl['Adj. Close'].shift(1))-1)
aapl['return'].head()
Date
2014-01-02 NaN
2014-01-03 -0.021966
2014-01-06 0.005453
2014-01-07 -0.007156
2014-01-08 0.006338
Name: return, dtype: float64
Daily return
- Returns profit/loss made by stock compared to previous stock.
- Values above 1 indicates profit and belove 1 loss.
aapl.fillna(0,inplace=True)
(aapl['return']+1).head()
Date
2014-01-02 1.000000
2014-01-03 0.978034
2014-01-06 1.005453
2014-01-07 0.992844
2014-01-08 1.006338
Name: return, dtype: float64
Cumulative Daily return
- Returns profit/loss relative to specific date.
- Values above 1 indicates profit and belove 1 loss.
- Use
.cumprodmethod.
aapl['Cumulative_returns']=(1+aapl['return']).cumprod()
aapl['Cumulative_returns'].head()
Date
2014-01-02 1.000000
2014-01-03 0.978034
2014-01-06 0.983367
2014-01-07 0.976330
2014-01-08 0.982518
Name: Cumulative_returns, dtype: float64
# doing same for all other stocks
for stock_df in (cisco,ibm,amzn):
stock_df['return'] = (stock_df['Adj. Close']/stock_df['Adj. Close'].shift(1))-1
stock_df.fillna(0,inplace=True)
stock_df['Cumulative_returns']=(1+stock_df['return']).cumprod()
ibm.head()
| Adj. Close | return | Cumulative_returns | |
|---|---|---|---|
| Date | |||
| 2014-01-02 | 162.670896 | 0.000000 | 1.000000 |
| 2014-01-03 | 163.644133 | 0.005983 | 1.005983 |
| 2014-01-06 | 163.082988 | -0.003429 | 1.002533 |
| 2014-01-07 | 166.335879 | 0.019946 | 1.022530 |
| 2014-01-08 | 164.810264 | -0.009172 | 1.013152 |
Allocations
The weights by which you are investing in individual stocks w.r.t your total instement
For this case lets take
- 30% AAPL
- 20% CISCO
- 40% AMZN
- 10% IBM
Lets get our adjusted normed value by multiplying them with respective allocations
for stock_df,allo in zip([aapl,cisco,ibm,amzn],[.3,.2,.4,.1]):
stock_df['Allocation'] = stock_df['Cumulative_returns']*allo
aapl.head()
| Adj. Close | return | Cumulative_returns | Allocation | |
|---|---|---|---|---|
| Date | ||||
| 2014-01-02 | 73.523423 | 0.000000 | 1.000000 | 0.300000 |
| 2014-01-03 | 71.908415 | -0.021966 | 0.978034 | 0.293410 |
| 2014-01-06 | 72.300536 | 0.005453 | 0.983367 | 0.295010 |
| 2014-01-07 | 71.783135 | -0.007156 | 0.976330 | 0.292899 |
| 2014-01-08 | 72.238063 | 0.006338 | 0.982518 | 0.294755 |
lets assume that we invested total of million dollar
for stock_df in [aapl,cisco,ibm,amzn]:
stock_df['Position Values'] = stock_df['Allocation']*1000000
portfolio_val = pd.concat([aapl['Position Values'],cisco['Position Values'],ibm['Position Values'],amzn['Position Values']],axis=1)
portfolio_val.columns = ['AAPL','CISCO','IBM','AMZN']
portfolio_val['Total Pos'] = portfolio_val.sum(axis=1)
portfolio_val.head()
| AAPL | CISCO | IBM | AMZN | Total Pos | |
|---|---|---|---|---|---|
| Date | |||||
| 2014-01-02 | 300000.000000 | 200000.000000 | 400000.000000 | 100000.000000 | 1.000000e+06 |
| 2014-01-03 | 293410.229060 | 199818.181818 | 402393.143966 | 99615.548911 | 9.952371e+05 |
| 2014-01-06 | 295010.214597 | 200090.909091 | 401013.313211 | 98909.465538 | 9.950239e+05 |
| 2014-01-07 | 292899.047240 | 202818.181818 | 409012.019619 | 100015.076513 | 1.004744e+06 |
| 2014-01-08 | 294755.301647 | 202663.636364 | 405260.604754 | 100992.537126 | 1.003672e+06 |
fig=plt.figure()
portfolio_val['Total Pos'].plot(figsize=(8,5))
plt.title('Total Portfolio Value')
Text(0.5, 1.0, 'Total Portfolio Value')

portfolio_val.drop('Total Pos',axis=1).plot(kind='line')
<matplotlib.axes._subplots.AxesSubplot at 0x7f79b16471d0>

Sharpe Ratio
The Sharpe Ratio is a measure for calculating risk-adjusted return.
where \(\mu\) is mean portfolio percentage daily return ,\(\sigma\) is standard dev. of portfolio percentage daily return and rf_r is risk free rate.
Ananulized sharpe ratio = \(k \times\) sharpe ratio
Values of \(k\) for diffrent sampling data.
- Daily = \(\sqrt{225}\)
- Weekly = \(\sqrt{52}\)
- Monthly = \(\sqrt{12}\)
Risk free rate is intrest rate of bank.Let \(x\)% be the yearly intrest rate of bank then risk free rate for daily rerurn will be
For most of the countries it is appox. zero
portfolio_val['Daily_return']=portfolio_val['Total Pos'].pct_change(1)
sr=portfolio_val['Daily_return'].mean()/portfolio_val['Daily_return'].std()
sr
0.02971216761205935
portfolio_val['Daily_return'].head()
Date
2014-01-02 NaN
2014-01-03 -0.004763
2014-01-06 -0.000214
2014-01-07 0.009769
2014-01-08 -0.001067
Name: Daily_return, dtype: float64
asr=(225**0.5)*sr
asr
0.4456825141808902
Portfolio Optimization
“Modern Portfolio Theory (MPT), a hypothesis put forth by Harry Markowitz in his paper “Portfolio Selection,” (published in 1952 by the Journal of Finance) is an investment theory based on the idea that risk-averse investors can construct portfolios to optimize or maximize expected return based on a given level of market risk, emphasizing that risk is an inherent part of higher reward.
According to MPT
where \(r_p\) is return of portfolio,\(r_i\) is return on indiual assest and \(\omega_i\) is a weight of assest \(r_i\).
and
\(\sigma_p^2=\sum_i \omega_i^2 \sigma_i^2 +\sum_i \sum_{j\neq i}\omega_i \omega_j\sigma_i \sigma_j\rho_{ij}\)
where \(\sigma^2\) is variance of returns of a assest,\(rho_{ij}\) is correalation between return of assest \(i\) and \(j\).
Now it can be written as
\(\sigma_p^2=\sum_i \sum_{j}\omega_i \omega_j\sigma_i \sigma_j\rho_{ij}\)
where \(\rho_{ij}=1\) if \(i=j\).
For portfolio optimization we have difrrent methods
Monte Carlo Simulation
It random simulates diffrent possible values for portfolio balance and finnaly gets to a optimal portfolio balance.
stocks = pd.concat([aapl["Adj. Close"],cisco["Adj. Close"],ibm["Adj. Close"],amzn["Adj. Close"]],axis=1)
stocks.columns = ['aapl','cisco','ibm','amzn']
stocks.head()
| aapl | cisco | ibm | amzn | |
|---|---|---|---|---|
| Date | ||||
| 2014-01-02 | 73.523423 | 19.451083 | 162.670896 | 397.97 |
| 2014-01-03 | 71.908415 | 19.433400 | 163.644133 | 396.44 |
| 2014-01-06 | 72.300536 | 19.459924 | 163.082988 | 393.63 |
| 2014-01-07 | 71.783135 | 19.725166 | 166.335879 | 398.03 |
| 2014-01-08 | 72.238063 | 19.710136 | 164.810264 | 401.92 |
#average daily reurn
stocks.pct_change(1).mean()
aapl 0.000882
cisco 0.000819
ibm 0.000008
amzn 0.001418
dtype: float64
#average yearly return
stocks.pct_change(1).mean()*252
aapl 0.222314
cisco 0.206357
ibm 0.001961
amzn 0.357435
dtype: float64
stocks.pct_change(1).corr()
| aapl | cisco | ibm | amzn | |
|---|---|---|---|---|
| aapl | 1.000000 | 0.421112 | 0.323143 | 0.351132 |
| cisco | 0.421112 | 1.000000 | 0.460829 | 0.325355 |
| ibm | 0.323143 | 0.460829 | 1.000000 | 0.254723 |
| amzn | 0.351132 | 0.325355 | 0.254723 | 1.000000 |
visulize_data(stocks.pct_change(1),(6,6))

stock_normed=stocks/stocks.iloc[0]
stock_normed.plot();

stock_daily_return=stocks.pct_change(1)
stock_daily_return.head()
| aapl | cisco | ibm | amzn | |
|---|---|---|---|---|
| Date | ||||
| 2014-01-02 | NaN | NaN | NaN | NaN |
| 2014-01-03 | -0.021966 | -0.000909 | 0.005983 | -0.003845 |
| 2014-01-06 | 0.005453 | 0.001365 | -0.003429 | -0.007088 |
| 2014-01-07 | -0.007156 | 0.013630 | 0.019946 | 0.011178 |
| 2014-01-08 | 0.006338 | -0.000762 | -0.009172 | 0.009773 |
Note: we will switch to logrithmic returns as they detrend and normalize the data.
To avoid warning due to 0 values in np.log function we will apply log individuals and then subtract.
log_return=np.log(stocks)-np.log(stocks.shift(1))
log_return.hist(bins=100,figsize=(12,6));
plt.tight_layout()

log_return.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| aapl | 1062.0 | 0.000758 | 0.014378 | -0.083303 | -0.005870 | 0.000493 | 0.008271 | 0.078794 |
| cisco | 1064.0 | 0.000739 | 0.012646 | -0.075177 | -0.005342 | 0.000644 | 0.007255 | 0.092034 |
| ibm | 1064.0 | -0.000064 | 0.012026 | -0.073791 | -0.005876 | 0.000128 | 0.006143 | 0.084934 |
| amzn | 1062.0 | 0.001245 | 0.018563 | -0.116503 | -0.006951 | 0.001106 | 0.010345 | 0.132178 |
log_return.cov()
| aapl | cisco | ibm | amzn | |
|---|---|---|---|---|
| aapl | 0.000207 | 0.000077 | 0.000056 | 0.000094 |
| cisco | 0.000077 | 0.000160 | 0.000070 | 0.000077 |
| ibm | 0.000056 | 0.000070 | 0.000145 | 0.000057 |
| amzn | 0.000094 | 0.000077 | 0.000057 | 0.000345 |
visulize_data(log_return,(6,6))

np.random.seed(101)
Basic monte carlo simulation using numpy
def monte_carlo_simulation(num_iter,log_ret):
"""
returns sharpe_ratio array,weights array,volatility array and yearly_log_return array
parameters:
num_iter: Number of itterations
log_ret=log return data frame of stocks
"""
weights_mat=np.zeros((num_iter,len(log_return.columns)))
returns=np.zeros(num_iter)
volatility = np.zeros(num_iter)
sharpe_ratio=np.zeros(num_iter)
for ind in range(num_iter):
weights=np.array(np.random.random(len(weights_mat[0])))
weights=weights/np.sum(weights)
weights_mat[ind,:]=weights
returns[ind]=np.sum((log_ret.mean() * weights) *252)
volatility[ind]=np.sqrt(np.dot(weights.T, np.dot(log_ret.cov() * 252, weights)))
sharpe_ratio[ind]=returns[ind]/volatility[ind]
print("Max Sharpe ratio: "+str(sharpe_ratio.max()))
ind=sharpe_ratio.argmax()
print("Weights: ")
for i in range(len(log_return.columns)):
print(str(log_return.columns[i])+": "+str(weights_mat[ind][i]))
return sharpe_ratio,weights_mat,volatility,returns
sr,weights,volatility,returns=monte_carlo_simulation(10000,log_return)
Max Sharpe ratio: 1.2621717668683905
Weights:
aapl: 0.24413890337217273
cisco: 0.36481460222702805
ibm: 0.0006651961521593167
amzn: 0.39038129824864
Plotting Volatility and returns
plt.figure(figsize=(8,8))
plt.scatter(volatility,returns,c=sr,cmap='plasma')
plt.colorbar(label='Sharpe Ratio')
plt.xlabel('Volatility')
plt.ylabel('Return')
# Add red dot for max SR
plt.scatter(volatility[sr.argmax()],returns[sr.argmax()],c='red',s=50,edgecolors='black')
<matplotlib.collections.PathCollection at 0x7f79b13b48d0>

Mathematical Optimization
Using scipy.optimize method which uses Sequential Least SQuares Programming (SLSQP).
from scipy.optimize import minimize
First we need the random weights
def get_weights(num_weights,num_iter):
weights_mat=np.zeros((num_iter,num_weights))
for ind in range(num_iter):
weights=np.array(np.random.random(len(weights_mat[0])))
weights=weights/np.sum(weights)
weights_mat[ind,:]=weights
return weights_mat
weights=get_weights(4,10000)
def get_ret_vol_sr(weights,log_return):
"""
returns array of yearly returns ,volatility,sharpe ratio ,"""
ret= np.sum(log_return.mean() * weights) * 252
vol = np.sqrt(np.dot(weights.T, np.dot(log_return.cov() * 252, weights)))
sr = ret/vol
return np.array([ret,vol,sr])
As we all trying to maximize the sharpe ratio using minimize method of scipy.optimize so we will minimize negative of sharpe ratio thus maximize sharpe ratio
def neg_sharpe(weights):
return get_ret_vol_sr(weights,log_return)[2]*-1
Here constrain function is \(\sum_{i=0}^n \omega_i=1\) so we need to check that
def check_constraint(weights):
"""return 0 if constraint is satisfied"""
return np.sum(weights)-1
# By convention of minimize function it should be a function that returns zero for conditions
cons = ({'type':'eq','fun': check_constraint})
For any weight \(\sigma_i\) we have \(0\leq \omega_i \leq 1\)
# adding bounds for each weights
bounds=((0,1),(0,1),(0,1),(0,1))
#intial weights
init=[2.5,2.5,2.5,2.5]
result=minimize(neg_sharpe,init,method="SLSQP",bounds=bounds,constraints=cons)
result
fun: -1.2631467864916681
jac: array([-1.19030476e-04, -2.96950340e-04, 6.49340272e-01, 3.66181135e-04])
message: 'Optimization terminated successfully.'
nfev: 30
nit: 5
njev: 5
status: 0
success: True
x: array([2.21702902e-01, 3.89945001e-01, 2.71267384e-16, 3.88352096e-01])
result.x
array([2.21702902e-01, 3.89945001e-01, 2.71267384e-16, 3.88352096e-01])
get_ret_vol_sr(result.x,log_return)
array([0.23680603, 0.18747309, 1.26314679])
Efficient Frontier
The efficient frontier is the set of optimal portfolios that offers the highest expected return for a defined level of risk or the lowest risk for a given level of expected return.
# Max Return is 0.23 so keeping range between 0-0.3
frontier_y=np.linspace(0,0.3,100)
def minimize_volatility(weights):
return get_ret_vol_sr(weights,log_return)[1]
frontier_volatility = []
for possible_return in frontier_y:
# function for return
cons = ({'type':'eq','fun': check_constraint},
{'type':'eq','fun': lambda w: get_ret_vol_sr(w,log_return)[0] - possible_return})
result = minimize(minimize_volatility,init,method='SLSQP',bounds=bounds,constraints=cons)
frontier_volatility.append(result['fun'])
plt.figure(figsize=(8,8))
plt.scatter(volatility,returns,c=sr,cmap='plasma')
plt.colorbar(label='Sharpe Ratio')
plt.xlabel('Volatility')
plt.ylabel('Return')
# Add frontier line
plt.plot(frontier_volatility,frontier_y,'g--',linewidth=3)
[<matplotlib.lines.Line2D at 0x7f79b12d39d0>]

Types of Funds
Exchange Traded Funds
It is an investement fund traded on stock exchanges. It constitute of basket of funds,bonds,commodities etc , which ETF distributors buy or sell to authorised partipicaints (large broker dealers with whom they have entered in agreement).Their holdings are public and transparent.Individuals can buy or sell marketable security through brokers just like stocks.The expense ration lies between \( 0.01\% -1 \% \).The most common ETF distributor is Spider(SPY) which trades the S&P500.
Mutual Funds
It is professionally managed investement funds that pools funds from from many investors to invest in securities. Its portfolio is structured and maintained to match the investment objectives stated in its prospectus.They mainly disclose their holdings once in a quater.The expense ratio lies between \(0.5\%-3\%\).Liquidity is commonly restricted to buy/sell at end of the day through broker.
Hedge Funds
It is a alternative investment fund that pools funds from certian investors (accreddited investors) to invest in securities using numerous portfolio constructions,tradings strageties and risk management techniques to earn high active returns for investors.They don’t dissclose any information regrading their trading techniques or investments to public or their investors.Commonly Expense ratio is \(2\%\) of fund and \(20\%\) of profits.The liquidity depends of agreement.
Capital Assets Pricing Model
Theoram
- In a fair market \(r_i(t)=\beta_i r_m(t)+\alpha\) where \(r_i\) return of any assect,\(r_m\) return of market and \(\beta\) and \(\alpha\) are constants.\(\alpha\) is almost equal to zero and not predictable.
So as per this thoream we cannot beat the fair market, we assume that this theoram doesnot hold in real world and try to predict the value of \(\alpha\) with confidence value \(> 50\%\) so that it yeilds acceptable results.
lets take a portfolio with return \(r_p\) now
Applying CAPM on portffolio we have
from scipy import stats
spy = web.DataReader('SPY','yahoo')
spy.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2015-02-02 to 2020-01-31
Data columns (total 6 columns):
High 1259 non-null float64
Low 1259 non-null float64
Open 1259 non-null float64
Close 1259 non-null float64
Volume 1259 non-null float64
Adj Close 1259 non-null float64
dtypes: float64(6)
memory usage: 68.9 KB
spy.tail()
| High | Low | Open | Close | Volume | Adj Close | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2020-01-27 | 325.119995 | 322.660004 | 323.029999 | 323.500000 | 84062500.0 | 323.500000 |
| 2020-01-28 | 327.850006 | 323.600006 | 325.059998 | 326.890015 | 63834000.0 | 326.890015 |
| 2020-01-29 | 328.630005 | 326.399994 | 328.380005 | 326.619995 | 53888900.0 | 326.619995 |
| 2020-01-30 | 327.910004 | 323.540009 | 324.359985 | 327.679993 | 75491800.0 | 327.679993 |
| 2020-01-31 | 327.170013 | 320.730011 | 327.000000 | 321.730011 | 113688600.0 | 321.730011 |
start = pd.to_datetime('2015-02-02')
end = pd.to_datetime('2020-01-31')
start = pd.to_datetime('2015-02-02')
end = pd.to_datetime('2020-01-31')
aapl = web.DataReader('AAPL','yahoo',start,end)
aapl.tail()
| High | Low | Open | Close | Volume | Adj Close | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2020-01-27 | 311.769989 | 304.880005 | 310.059998 | 308.950012 | 40485000.0 | 308.950012 |
| 2020-01-28 | 318.399994 | 312.190002 | 312.600006 | 317.690002 | 40558500.0 | 317.690002 |
| 2020-01-29 | 327.850006 | 321.380005 | 324.450012 | 324.339996 | 54057300.0 | 324.339996 |
| 2020-01-30 | 324.089996 | 318.750000 | 320.540009 | 323.869995 | 31685800.0 | 323.869995 |
| 2020-01-31 | 322.679993 | 308.290009 | 320.929993 | 309.510010 | 49750700.0 | 309.510010 |
aapl['Cumulative'] = aapl['Close']/aapl['Close'].iloc[0]
spy['Cumulative'] = spy['Close']/spy['Close'].iloc[0]
plt.figure()
aapl['Cumulative'].plot(label='AAPL',figsize=(8,8))
spy['Cumulative'].plot(label='SPY Index')
plt.legend()
plt.title('Cumulative Return')
Text(0.5, 1.0, 'Cumulative Return')

aapl['Daily Return'] = aapl['Close'].pct_change(1)
spy['Daily Return'] = spy['Close'].pct_change(1)
plt.figure()
plt.scatter(aapl['Daily Return'],spy['Daily Return'],alpha=0.3)
<matplotlib.collections.PathCollection at 0x7f79b1430e50>

plt.figure()
aapl['Daily Return'].hist(bins=100,label="AAPL");
spy['Daily Return'].hist(bins=100,label="SPY",alpha=0.6);

beta,alpha,r_value,p_value,std_err = stats.linregress(aapl['Daily Return'].iloc[1:],spy['Daily Return'].iloc[1:])
print("Beta: {} \nAlpha : {} \nR Value : {} \nP Value: {} \nStd:{}".format(beta,alpha,r_value,p_value,std_err))
Beta: 0.37024455075001905
Alpha : 7.901433583546179e-05
R Value : 0.6807275590391761
P Value: 5.554999957679808e-172
Std:0.011242163556337528
Here R Value represent the confidence value which is 0.67 that is low lets take a fake data that moves according to SPY5000 check the R value for that.
noise = np.random.normal(0,0.001,len(spy['Daily Return'].iloc[1:]))
beta,alpha,r_value,p_value,std_err = stats.linregress(spy['Daily Return'].iloc[1:]+noise,spy['Daily Return'].iloc[1:])
print("Beta: {} \nAlpha : {} \nR Value : {} \nP Value: {} \nStd:{}".format(beta,alpha,r_value,p_value,std_err))
Beta: 0.9809402276980321
Alpha : 3.7134513497252776e-05
R Value : 0.9930827588166153
P Value: 0.0
Std:0.003272588565826361
OHLC Charts
Candlestick plot
- A candlestick is a type of price chart used that displays the high, low, open, and closing prices of a security for a specific period.
- The wide part of the candlestick is called the “real body” and tells investors whether the closing price was higher or lower than the opening price (black/red if the stock closed lower, white/green if the stock closed higher) as shown in figure.
- Candlesticks are a suitable technique for trading any liquid financial asset such as stocks, foreign exchange and futures.

- Long white/green candlesticks indicate there is strong buying pressure; this typically indicates price is bullish.
- Long black/red candlesticks indicate there is significant selling pressure. This suggests the price is bearish.
import matplotlib.dates as mdates
import mpl_finance as finance
df_ohlc = aapl['Close'].resample('10D').ohlc()
df_vol = aapl["Volume"].resample("10D").sum()
df_ohlc.reset_index(inplace=True)
df_ohlc["Date"]=df_ohlc["Date"].map(mdates.date2num)
fig=plt.figure(figsize=(15,6))
ax1=plt.subplot2grid((8,1),(0,0),rowspan=6,colspan=1)
ax2=plt.subplot2grid((8,1),(7,0),rowspan=2,colspan=1,sharex=ax1)
ax1.xaxis_date()
finance.candlestick_ohlc(ax1,df_ohlc.values,width=2,colorup='g')
ax2.fill_between(df_vol.index.map(mdates.date2num),df_vol.values,0)
<matplotlib.collections.PolyCollection at 0x7f79b147d290>

import plotly.graph_objects as go
Interactive CandleStick Plot
fig = go.Figure(data=[go.Candlestick(x=aapl.index,
open=aapl['Open'], high=aapl['High'],
low=aapl['Low'], close=aapl['Close'])
])
fig.update_layout(xaxis_rangeslider_visible=False)
fig.show()
fig.write_html("Candlestick1.html")
#with X_axis Slider
fig = go.Figure(data=[go.Candlestick(x=aapl.index,
open=aapl['Open'], high=aapl['High'],
low=aapl['Low'], close=aapl['Close'])
])
fig.update_layout(
title='Candlestick',
yaxis_title='AAPL Stock',
shapes = [dict(
x0='2016-12-09', x1='2016-12-09', y0=0, y1=1, xref='x', yref='paper',
line_width=2)],
annotations=[dict(
x='2016-12-09', y=0.05, xref='x', yref='paper',
showarrow=False, xanchor='left', text='Increase Period Begins')]
)
fig.show()
fig.write_html("Candlestick2.html")
Forecasting
ARIMA
from statsmodels.tsa.stattools import adfuller
# Store in a function for later use!
def adf_check(time_series):
"""
Function to return formated adfuller test results
Parameters time series
"""
result = adfuller(time_series)
print('Augmented Dickey-Fuller Test:')
labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
for value,label in zip(result,labels):
print(label+' : '+str(value) )
if result[1] <= 0.05:
print("strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary")
else:
print("weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")
start = pd.to_datetime('2015-02-02')
end = pd.to_datetime('2020-01-31')
aapl = web.DataReader('AAPL','yahoo',start,end)
df=pd.DataFrame(aapl.Close)
df=df.resample(rule="B").mean()
for i in range(len(df)):
if np.isnan(df["Close"][i]):
df["Close"][i]=np.mean((df['Close'][i+1],df['Close'][i-1]))
adf_check(df['Close'])
Augmented Dickey-Fuller Test:
ADF Test Statistic : 1.3835998531065719
p-value : 0.9970383501642863
#Lags Used : 7
Number of Observations Used : 1297
weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary
df['First Difference'] = df['Close'] - df['Close'].shift(1)
df['Second Difference'] = df['First Difference'] - df['First Difference'].shift(1)
adf_check(df['First Difference'].dropna())
Augmented Dickey-Fuller Test:
ADF Test Statistic : -12.24313277019121
p-value : 9.920782061012239e-23
#Lags Used : 6
Number of Observations Used : 1297
strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary
adf_check(df['Second Difference'].dropna())
Augmented Dickey-Fuller Test:
ADF Test Statistic : -13.756009401197572
p-value : 1.029332939728576e-25
#Lags Used : 20
Number of Observations Used : 1282
strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary
df['Seasonal Diffrence']=df["Close"]-df["Close"].shift(12)
adf_check(df['Seasonal Diffrence'].dropna())
Augmented Dickey-Fuller Test:
ADF Test Statistic : -6.640670360647956
p-value : 5.420146475942849e-09
#Lags Used : 22
Number of Observations Used : 1270
strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary
For our model we are going to use Second diffrencing and seasonal diffrence
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
fig=plot_acf(df['First Difference'].dropna())

fig=plot_pacf(df["First Difference"].dropna())

fig=plot_acf(df['Second Difference'].dropna())

fig=plot_pacf(df["Second Difference"].dropna())

fig=plot_acf(df['Seasonal Diffrence'].dropna())

fig=plot_pacf(df["Seasonal Diffrence"].dropna())

We are going to use AR(1) I(1) and MA(1) for non seasonal ARIMA.For seasonal ARIMA we are going to set seasonally to 12 with AR(1) I(1) and MA(1)
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
model_ns=ARIMA(df["Close"],(1,1,1),dates=df.index,freq='B')
results_ns = model_ns.fit()
print(results_ns.summary())
ARIMA Model Results
==============================================================================
Dep. Variable: D.Close No. Observations: 1304
Model: ARIMA(1, 1, 1) Log Likelihood -3038.936
Method: css-mle S.D. of innovations 2.488
Date: Sun, 02 Feb 2020 AIC 6085.873
Time: 11:26:37 BIC 6106.566
Sample: 02-03-2015 HQIC 6093.636
- 01-31-2020
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
const 0.1461 0.070 2.078 0.038 0.008 0.284
ar.L1.D.Close -0.8300 0.106 -7.833 0.000 -1.038 -0.622
ma.L1.D.Close 0.8672 0.095 9.175 0.000 0.682 1.052
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -1.2048 +0.0000j 1.2048 0.5000
MA.1 -1.1531 +0.0000j 1.1531 0.5000
-----------------------------------------------------------------------------
results_ns.resid.plot(figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f79a06e0990>

results_ns.resid.plot(kind="kde")
<matplotlib.axes._subplots.AxesSubplot at 0x7f79ba0f6b10>

model = sm.tsa.statespace.SARIMAX(df['Close'],order=(0,1,0), seasonal_order=(1,1,1,24),dates=df.index,freq='B')
results = model.fit()
print(results.summary())
Statespace Model Results
==========================================================================================
Dep. Variable: Close No. Observations: 1305
Model: SARIMAX(0, 1, 0)x(1, 1, 1, 24) Log Likelihood -3036.476
Date: Sun, 02 Feb 2020 AIC 6078.952
Time: 11:26:48 BIC 6094.416
Sample: 02-02-2015 HQIC 6084.759
- 01-31-2020
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.S.L24 0.0092 0.028 0.333 0.739 -0.045 0.064
ma.S.L24 -0.9852 0.029 -34.200 0.000 -1.042 -0.929
sigma2 6.3278 0.178 35.599 0.000 5.979 6.676
===================================================================================
Ljung-Box (Q): 58.50 Jarque-Bera (JB): 1597.59
Prob(Q): 0.03 Prob(JB): 0.00
Heteroskedasticity (H): 4.22 Skew: -0.64
Prob(H) (two-sided): 0.00 Kurtosis: 8.32
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
results.resid.plot(figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f79a0066210>

results.resid.plot(kind="kde")
<matplotlib.axes._subplots.AxesSubplot at 0x7f798bbd0e50>

Here errror osilates between (-20,20) so the performance is poor and if we plot the prediction and actual data we can see that predictions are not able to follow the actual sesonality.
df['Predict'] = results.predict(start = 1240, end= 1305, dynamic= True)
df[['Close','Predict']][1240:1305].plot(figsize=(6,4));

SVR
import pandas as pd
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
aapl.head()
| High | Low | Open | Close | Volume | Adj Close | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2015-02-02 | 119.169998 | 116.080002 | 118.050003 | 118.629997 | 62739100.0 | 108.999298 |
| 2015-02-03 | 119.089996 | 117.610001 | 118.500000 | 118.650002 | 51915700.0 | 109.017654 |
| 2015-02-04 | 120.510002 | 118.309998 | 118.500000 | 119.559998 | 70149700.0 | 109.853806 |
| 2015-02-05 | 120.230003 | 119.250000 | 120.019997 | 119.940002 | 42246200.0 | 110.637886 |
| 2015-02-06 | 120.250000 | 118.449997 | 120.019997 | 118.930000 | 43706600.0 | 109.706200 |
dates=aapl.index[10:-40]
data=np.array(aapl["Close"])
y=data[10:-9]
x=np.zeros((len(y),10))
for i in range(10,len(y)+10):
x[i-10]=data[i-10:i]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = x[:int(len(x)*0.75)], x[int(len(x)*0.75):],y[:int(len(x)*0.75)],y[int(len(x)*0.75):]
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1,verbose=True)
svr_rbf.fit(X_train,y_train)
[LibSVM]
SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=True)
svr_rbf.score(X_test,y_test)
-2.526752379015568
Here Accuracy is very low that suggests that simple models show poor performance in stock price prediction
ANN
from sklearn.neural_network import MLPRegressor
model=MLPRegressor((50,100,50),batch_size=16,learning_rate='adaptive',early_stopping=True,warm_start=True,verbose=True)
model.fit(X_train,y_train)
Iteration 87, loss = 12.93616072
Validation score: 0.995055
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
MLPRegressor(activation='relu', alpha=0.0001, batch_size=16, beta_1=0.9,
beta_2=0.999, early_stopping=True, epsilon=1e-08,
hidden_layer_sizes=(50, 100, 50), learning_rate='adaptive',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver='adam', tol=0.0001,
validation_fraction=0.1, verbose=True, warm_start=True)
model.score(X_test,y_test)
0.9868282389698162
ann_df=pd.DataFrame()
ann_df['Data']=y_test
ann_df["ANN Predict"]=model.predict(X_test)
ann_df.index=dates[-len(y_test):]
fig=go.Figure()
fig.add_trace(go.Scatter(x=ann_df.index, y=ann_df['Data'], mode='lines',
name="Data",
line=dict(color='rgb(115,115,115)', width=3),
connectgaps=True,
))
fig.add_trace(go.Scatter(x=ann_df.index, y=ann_df['ANN Predict'], mode='lines',
name="Predict",
line=dict(color='rgb(49,130,189)', width=2),
connectgaps=True,
))
fig.show()
fig.write_html("ANN.html")
loss=model.loss_curve_
num_iter=range(model.n_iter_)
val_score=model.validation_scores_
plt.figure()
plt.plot(num_iter,loss,label="Loss")
plt.legend()
<matplotlib.legend.Legend at 0x7f798abcd350>

plt.figure()
plt.plot(num_iter,val_score,label="Score")
plt.legend()
<matplotlib.legend.Legend at 0x7f798ab28fd0>

LSTM
from keras.models import Sequential
from keras.layers import Dense, LSTM
import keras
import keras.optimizers as optim
Using TensorFlow backend.
X_train=X_train.reshape((X_train.shape[0],X_train.shape[1],1))
start = pd.to_datetime('2009-02-02')
end = pd.to_datetime('2020-01-31')
aapl = web.DataReader("aapl","yahoo",start,end)
data = aapl.Close.values
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(data.reshape(-1,1))
data= scaled_data.squeeze()
y=data[60:-40]
x=np.zeros((len(y),60))
for i in range(60,len(y)+60):
x[i-60]=data[i-60:i]
X_train, X_test, y_train, y_test = x[:int(len(x)*0.75)], x[int(len(x)*0.75):],y[:int(len(x)*0.75)],y[int(len(x)*0.75):]
X_train=X_train.reshape((X_train.shape[0],X_train.shape[1],1))
X_test=X_test.reshape((X_test.shape[0],X_test.shape[1],1))
X_train.shape
(2001, 60, 1)
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape= (X_train.shape[1], 1)))
model.add(LSTM(50, return_sequences= False))
model.add(Dense(25))
model.add(Dense(1))
model.compile(optimizer=optim.Adam(learning_rate=0.001), loss='mean_squared_error')
history=model.fit(X_train, y_train, batch_size=1, epochs=5,validation_split=0.1)
Train on 1800 samples, validate on 201 samples
Epoch 1/5
1800/1800 [==============================] - 48s 27ms/step - loss: 1.0387e-04 - val_loss: 3.7598e-05
Epoch 2/5
1800/1800 [==============================] - 49s 27ms/step - loss: 6.9256e-05 - val_loss: 3.0741e-05
Epoch 3/5
1800/1800 [==============================] - 49s 27ms/step - loss: 6.6532e-05 - val_loss: 2.2441e-05
Epoch 4/5
1800/1800 [==============================] - 49s 27ms/step - loss: 5.1043e-05 - val_loss: 3.6214e-04
Epoch 5/5
1800/1800 [==============================] - 51s 28ms/step - loss: 5.4891e-05 - val_loss: 2.5737e-05
actual_data = aapl.Close.values[60:-40]
y_test = actual_data[int(len(x)*0.75):]
df=pd.DataFrame()
X_test.shape, X_train.shape
((668, 60, 1), (2001, 60, 1))
df["predict"]=scaler.inverse_transform(model.predict(X_test)).reshape(len(y_test))
df['data']=y_test
df.index=dates[-len(y_test):]
fig=go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=df['data'], mode='lines',
name="Data",
line=dict(color='rgb(115,115,115)', width=3),
connectgaps=True,
))
fig.add_trace(go.Scatter(x=df.index, y=df['predict'], mode='lines',
name="Predict",
line=dict(color='rgb(49,130,189)', width=2),
connectgaps=True,
))
fig.show()
fig.write_html("LSTM.html")
history.history
{'val_loss': [0.00010109769987061885, 3.839707523954579e-05],
'loss': [0.00028365603561529015, 0.00012780741379185667]}